dplyr
packagehelp() to look up function
documentationNow that you’ve learned the basics of R programming, we’ll take things a step further and start working on your skills related to data analysis. You will likely be unfamiliar with some of the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available so that you can become an effective problem solver, working independently on data analyses.
So far, we’ve run code in the Console. This is fine for quick queries. For anything to be shareable, reproducible, and to reduce our own efforts in the long-term, we should run code via a script. An R script is a just a simple text file. R-Studio uses the script by copying R commands from text in the file and pastes them into the Console as if you were manually entering the commands yourself. This greatly enhances our ability to build off what we have created in the past, learn from previous experience, and quickly re-run analyses when new data are received. To create an R script:
R-Studio will open your R script automatically after creating it. Notice that the scripting window appears above the Console in what’s known as the Source pane.
Scripts folder. Name this file
IntroDataMgmt.R.
At the top
of your script, provide brief information that describes the content of
your script. The content is up to you, but should briefly identify who
created the file, when the file was created, and what the script does.
This will help when you return to the file at a later date or if you
decide to share the file with a colleague in the future. Remember that
anything after the # symbol is a comment. Use this symbol
to make your code more readable, similar to below.
# ******************************************************************
# ******************************************************************
# Project: Introduction to Data Analysis in R
# Description: A script which details some basic commands on how to manipulate data
# Author: <Your Name>
# Date Initialized: <dd month yyyy>
# ******************************************************************
# ******************************************************************
Running code via an R script is different than running code in the Console. To interpret and run the code you’ve written, R needs you to send the code from the script to the Console. Some common ways to run your code include:
Ctrl + Enter (Windows) or command
+ return (Mac).Run button
in the top right of the Source pane with your mouse.You’ll find that your Environment (Workspace) in the upper right panel will quickly become full with user-defined objects. It’s generally good practice to work with a clean Workspace when starting a session. I generally start all my scripts with the following command to make sure you are starting fresh, something we will do to help develop good programming practices and reduce clutter.
# Clean your workspace/remove all objects
rm(list=ls())
# You can also remove a specific dataset using the following command
#rm(dataset)
DplyrThe most basic R skill is to query and manipulate data tables. As a beginner programmer, it is imperative to familiarize yourself with how to manipulate data. Reinforcing these skills is like expanding your vocabulary in the new language that you are learning and is a great way to improve your R proficiency. If you wish to become really good at R, but don’t know where to start, start with data table manipulation!
The base R functions that come with the default R installation have
the capacity for almost all the table manipulation needs (e.g.,
split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()).
However, sometimes their syntax are less user-friendly and intuitive
than some of the special packages built for table manipulation purposes.
So, here we are introducing a few of the most useful table manipulation
functions within dplyr package.
Note that you will have to use install.packages() and
library() function to download and activate the
dplyr before using it. You only need to install the package
once on your computer. You will need, however, to ‘activate’ the package
any time you want to use the functions that exist within the
package.
#install.packages("dplyr")
library(dplyr)
R has multiple functions for reading in table data. Here we’ll use
the base function read.csv() to import a table named
panda_data.csv that is located in your Data
folder. Text files (.txt) can be imported using the
function read.delim(). See the help files for each function
and search Google for information on other functions to read other data
types.
View the first few rows of the data table using the function
head() or click on the dataframe in the Environment/History
panel.
# Read dataset
panda_data <- read.csv(file="Data/panda_data.csv")
# Look at the data
head(panda_data)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## genetic_value3 year
## 1 60 2017
## 2 90 2017
## 3 80 2017
## 4 65 2017
## 5 76 2017
## 6 87 2017
Questions:
head() function?# Print the first 3 rows
head(panda_data, n = 3)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## genetic_value3 year
## 1 60 2017
## 2 90 2017
## 3 80 2017
# Print rows 4 thru 6
panda_data[4:6,]
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## genetic_value3 year
## 4 65 2017
## 5 76 2017
## 6 87 2017
# Dataframe dimensions
dim(panda_data)
## [1] 10 10
nrow(panda_data)
## [1] 10
ncol(panda_data)
## [1] 10
# Print the tail
tail(panda_data)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## 8 shuang_shuang 4424657 8 110 m WL 92 NA
## 9 qiuqiu 4423758 9 130 f CD 81 50
## 10 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year
## 5 76 2017
## 6 87 2017
## 7 92 2017
## 8 50 2017
## 9 30 2018
## 10 70 2018
The function select() is a powerful tool for selecting
columns of interest. You must specify the dataset you want to query and
then also provide an expression for selecting columns of interest
(select(.data, expression)). A few examples are provided
below:
# select column called panda_name
select(panda_data, panda_name)
## panda_name
## 1 da_da
## 2 mao_mao
## 3 lan_lan
## 4 bei_bei
## 5 bao_bao
## 6 tian_tian
## 7 wei_wei
## 8 shuang_shuang
## 9 qiuqiu
## 10 lang_lang
# select all columns in the data except panda_name
select(panda_data, -panda_name)
## ID age weight_kg sex base genetic_value1 genetic_value2 genetic_value3
## 1 4415463 5 100 m CD 99 50 60
## 2 4415522 4 120 f CD 70 30 90
## 3 4416073 7 95 f WL 80 NA 80
## 4 4416405 5 120 f WL 80 NA 65
## 5 4417779 5 110 m WL 75 60 76
## 6 4424490 4 89 m WL 95 70 87
## 7 4424967 7 98 f CD 60 NA 92
## 8 4424657 8 110 m WL 92 NA 50
## 9 4423758 9 130 f CD 81 50 30
## 10 4427758 10 110 m WL 60 NA 70
## year
## 1 2017
## 2 2017
## 3 2017
## 4 2017
## 5 2017
## 6 2017
## 7 2017
## 8 2017
## 9 2018
## 10 2018
# select a range of columns, from age to sex
select(panda_data, age:sex)
## age weight_kg sex
## 1 5 100 m
## 2 4 120 f
## 3 7 95 f
## 4 5 120 f
## 5 5 110 m
## 6 4 89 m
## 7 7 98 f
## 8 8 110 m
## 9 9 130 f
## 10 10 110 m
Various selection helpers also exist, including:
starts_with: Expression select multiple columns that
start with the same text.ends_with(): Expression to select columns that end with
the same text.contains(): Expression to select columns that contain
the same text.matches(): Expression to select columns that match a
regular expression.one_of(): Expression to select columns that are from a
group of names.# select all columns that start with "genetic" in their column names
select(panda_data, starts_with("genetic"))
## genetic_value1 genetic_value2 genetic_value3
## 1 99 50 60
## 2 70 30 90
## 3 80 NA 80
## 4 80 NA 65
## 5 75 60 76
## 6 95 70 87
## 7 60 NA 92
## 8 92 NA 50
## 9 81 50 30
## 10 60 NA 70
Questions:
panda_name and
age.# Select all the columns that contain "value" in the column name.
select(panda_data, contains("value"))
## genetic_value1 genetic_value2 genetic_value3
## 1 99 50 60
## 2 70 30 90
## 3 80 NA 80
## 4 80 NA 65
## 5 75 60 76
## 6 95 70 87
## 7 60 NA 92
## 8 92 NA 50
## 9 81 50 30
## 10 60 NA 70
# Select or exclude two columns: `panda_name` and `age`.
select(panda_data, c(panda_name, age))
## panda_name age
## 1 da_da 5
## 2 mao_mao 4
## 3 lan_lan 7
## 4 bei_bei 5
## 5 bao_bao 5
## 6 tian_tian 4
## 7 wei_wei 7
## 8 shuang_shuang 8
## 9 qiuqiu 9
## 10 lang_lang 10
select(panda_data, !c(panda_name, age))
## ID weight_kg sex base genetic_value1 genetic_value2 genetic_value3 year
## 1 4415463 100 m CD 99 50 60 2017
## 2 4415522 120 f CD 70 30 90 2017
## 3 4416073 95 f WL 80 NA 80 2017
## 4 4416405 120 f WL 80 NA 65 2017
## 5 4417779 110 m WL 75 60 76 2017
## 6 4424490 89 m WL 95 70 87 2017
## 7 4424967 98 f CD 60 NA 92 2017
## 8 4424657 110 m WL 92 NA 50 2017
## 9 4423758 130 f CD 81 50 30 2018
## 10 4427758 110 m WL 60 NA 70 2018
Filter() is similar to select(), except
that you are selecting specific rows that satisfy a requirement based on
a column value. This function is very similar to the base function
subset(). A few examples on how to use
filter() are provided below:
# Select rows where pandas are greater than or equal to 5 years of age
filter(panda_data, age >= 5)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 lan_lan 4416073 7 95 f WL 80 NA
## 3 bei_bei 4416405 5 120 f WL 80 NA
## 4 bao_bao 4417779 5 110 m WL 75 60
## 5 wei_wei 4424967 7 98 f CD 60 NA
## 6 shuang_shuang 4424657 8 110 m WL 92 NA
## 7 qiuqiu 4423758 9 130 f CD 81 50
## 8 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year
## 1 60 2017
## 2 80 2017
## 3 65 2017
## 4 76 2017
## 5 92 2017
## 6 50 2017
## 7 30 2018
## 8 70 2018
# select rows that have age>5 OR weight_kg >100
filter(panda_data, age > 5 | weight_kg > 100)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 mao_mao 4415522 4 120 f CD 70 30
## 2 lan_lan 4416073 7 95 f WL 80 NA
## 3 bei_bei 4416405 5 120 f WL 80 NA
## 4 bao_bao 4417779 5 110 m WL 75 60
## 5 wei_wei 4424967 7 98 f CD 60 NA
## 6 shuang_shuang 4424657 8 110 m WL 92 NA
## 7 qiuqiu 4423758 9 130 f CD 81 50
## 8 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year
## 1 90 2017
## 2 80 2017
## 3 65 2017
## 4 76 2017
## 5 92 2017
## 6 50 2017
## 7 30 2018
## 8 70 2018
# select rows that have age>5 AND base column has CD has entry
filter(panda_data, age > 5 & base == "CD")
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 wei_wei 4424967 7 98 f CD 60 NA
## 2 qiuqiu 4423758 9 130 f CD 81 50
## genetic_value3 year
## 1 92 2017
## 2 30 2018
# Select rows where the panda age is defined by a few values. Note the use of the %in% function.
filter(panda_data, age %in% c(4,5,7))
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## genetic_value3 year
## 1 60 2017
## 2 90 2017
## 3 80 2017
## 4 65 2017
## 5 76 2017
## 6 87 2017
## 7 92 2017
Questions:
1: Select rows with NA in the genetic_value2
column.
2: Select rows whose panda_name column are bao_bao or
bei_bei.
# Select rows with `NA` in the genetic_value2 column.
filter(panda_data, is.na(genetic_value2))
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 lan_lan 4416073 7 95 f WL 80 NA
## 2 bei_bei 4416405 5 120 f WL 80 NA
## 3 wei_wei 4424967 7 98 f CD 60 NA
## 4 shuang_shuang 4424657 8 110 m WL 92 NA
## 5 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year
## 1 80 2017
## 2 65 2017
## 3 92 2017
## 4 50 2017
## 5 70 2018
# Select rows whose panda_name column are `bao_bao` or `bei_bei`
filter(panda_data, panda_name == 'bao_bao' | panda_name == 'bei_bei')
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 bei_bei 4416405 5 120 f WL 80 NA
## 2 bao_bao 4417779 5 110 m WL 75 60
## genetic_value3 year
## 1 65 2017
## 2 76 2017
Piping (%>%) allows the user to combine the output
from one function to the input of another. Thus, instead of nesting
functions (reading from the inside to the outside), piping reads
functions from left to right (i.e., the way we normally read things). As
a result, reading piped code can be more intuitive and can help avoid
creating and saving a lot of intermediate variables that you don’t
need.
# Use a simple pipe to select the panda name and it's sex, and output the result.
# Create a new object named 'pipe_result'
pipe_result <- panda_data %>%
select(panda_name, sex) %>%
head()
# Output the result to the screen
pipe_result
## panda_name sex
## 1 da_da m
## 2 mao_mao f
## 3 lan_lan f
## 4 bei_bei f
## 5 bao_bao m
## 6 tian_tian m
Questions:
# Select columns that contain genetic and then selection values based on these columns
panda_data %>%
select(contains("genetic")) %>%
filter(genetic_value1 > 80 & genetic_value2 < 90)
## genetic_value1 genetic_value2 genetic_value3
## 1 99 50 60
## 2 95 70 87
## 3 81 50 30
# Example of filtering a column that doesn't exist after selecting
# The field doesn't exist and will cause an error
# panda_data %>%
# select(contains("genetic")) %>%
# filter(weight_kg > 90)
One of the most useful functions in dplyr package is
mutate(). Mutate allows the user to create new column(s),
populating the columns with values that you define or from information
in columns that already exist. You can also use mutate() to
control which columns are retained in the new object that you create by
setting the .keep argument.
# Create a new column, based on values from other columns that exist
# By default, keep = all
new_col_ex1 <- panda_data %>%
mutate(genetic_value_new = genetic_value1 - genetic_value2,
.keep = "all")
# Print
new_col_ex1
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## 8 shuang_shuang 4424657 8 110 m WL 92 NA
## 9 qiuqiu 4423758 9 130 f CD 81 50
## 10 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year genetic_value_new
## 1 60 2017 49
## 2 90 2017 40
## 3 80 2017 NA
## 4 65 2017 NA
## 5 76 2017 15
## 6 87 2017 25
## 7 92 2017 NA
## 8 50 2017 NA
## 9 30 2018 31
## 10 70 2018 NA
# You can create multiple columns at once. Best to put each new column on a separate line. This simply makes the code more readable.
new_col_ex2 <- panda_data %>%
mutate(genetic_dif = genetic_value1 - genetic_value2,
weight_g = weight_kg * 1000)
# Print
new_col_ex2
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## 8 shuang_shuang 4424657 8 110 m WL 92 NA
## 9 qiuqiu 4423758 9 130 f CD 81 50
## 10 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year genetic_dif weight_g
## 1 60 2017 49 100000
## 2 90 2017 40 120000
## 3 80 2017 NA 95000
## 4 65 2017 NA 120000
## 5 76 2017 15 110000
## 6 87 2017 25 89000
## 7 92 2017 NA 98000
## 8 50 2017 NA 110000
## 9 30 2018 31 130000
## 10 70 2018 NA 110000
Question:
panda_data called
zero and give it a value of 0?# Create a new column called zero
panda_data %>%
mutate(zero = 0)
## panda_name ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## 7 wei_wei 4424967 7 98 f CD 60 NA
## 8 shuang_shuang 4424657 8 110 m WL 92 NA
## 9 qiuqiu 4423758 9 130 f CD 81 50
## 10 lang_lang 4427758 10 110 m WL 60 NA
## genetic_value3 year zero
## 1 60 2017 0
## 2 90 2017 0
## 3 80 2017 0
## 4 65 2017 0
## 5 76 2017 0
## 6 87 2017 0
## 7 92 2017 0
## 8 50 2017 0
## 9 30 2018 0
## 10 70 2018 0
It is common to receive data in ways that we want to change or alter.
For instance, we might want to change the order of the columns, re-order
the rows based on particular column values, or we may want to change the
column headings. Mutate() includes options to change the
order of columns that you create (see .before and
.after in the help for the function).
Relocate() can be used move columns around using the
.before and .after commands,
arrange() can be used to order the rows, and
rename() can be used to give new names to column headings.
By default, arrange will arrange the selecting column in
ascending order. We can use desc() to arrange the column in
descending order.
# Arrange the panda_data by panda_name in descending order
# Place the sex column after the ID column
order_data_ex1 <- panda_data %>%
arrange(desc(panda_name)) %>%
relocate(sex, .after = ID)
# Print
head(order_data_ex1)
## panda_name ID sex age weight_kg base genetic_value1 genetic_value2
## 1 wei_wei 4424967 f 7 98 CD 60 NA
## 2 tian_tian 4424490 m 4 89 WL 95 70
## 3 shuang_shuang 4424657 m 8 110 WL 92 NA
## 4 qiuqiu 4423758 f 9 130 CD 81 50
## 5 mao_mao 4415522 f 4 120 CD 70 30
## 6 lang_lang 4427758 m 10 110 WL 60 NA
## genetic_value3 year
## 1 92 2017
## 2 87 2017
## 3 50 2017
## 4 30 2018
## 5 90 2017
## 6 70 2018
# Arrange the panda_data by genetic_value 1 and genetic_value2 after select the columns that start with 'genetic'
# Rename the genetic_value1 column to gen_val1
order_data_ex2 <- panda_data %>%
select(starts_with("genetic")) %>%
arrange(genetic_value1, genetic_value2) %>%
rename(gen_val1 = genetic_value1)
# Print
head(order_data_ex2)
## gen_val1 genetic_value2 genetic_value3
## 1 60 NA 92
## 2 60 NA 70
## 3 70 30 90
## 4 75 60 76
## 5 80 NA 80
## 6 80 NA 65
Question:
panda_wgt_kg.# Arrange the panda dataset by age and weight, relocating columns and renaming the weight column
panda_data %>%
arrange(age, weight_kg) %>%
relocate(year, .before = age) %>%
relocate(base, .after = last_col()) %>%
rename(panda_wgt_kg = weight_kg)
## panda_name ID year age panda_wgt_kg sex genetic_value1
## 1 tian_tian 4424490 2017 4 89 m 95
## 2 mao_mao 4415522 2017 4 120 f 70
## 3 da_da 4415463 2017 5 100 m 99
## 4 bao_bao 4417779 2017 5 110 m 75
## 5 bei_bei 4416405 2017 5 120 f 80
## 6 lan_lan 4416073 2017 7 95 f 80
## 7 wei_wei 4424967 2017 7 98 f 60
## 8 shuang_shuang 4424657 2017 8 110 m 92
## 9 qiuqiu 4423758 2018 9 130 f 81
## 10 lang_lang 4427758 2018 10 110 m 60
## genetic_value2 genetic_value3 base
## 1 70 87 WL
## 2 30 90 CD
## 3 50 60 CD
## 4 60 76 WL
## 5 NA 65 WL
## 6 NA 80 WL
## 7 NA 92 CD
## 8 NA 50 WL
## 9 50 30 CD
## 10 NA 70 WL
Summary statistics are easily calculated using the
summarise() function (note: summarize() will
also work). The summary statistics will be calculated among all rows in
the dataframe, unless specified otherwise. Summarize() is
often used in combination with group_by() (see below for
more information). Various useful summary functions are provided in the
help documentation for the function.
# Calculate the mean weight of the pandas and provide a count.
sum_table <- panda_data %>%
summarise(mean_wgt = mean(weight_kg),
n = n())
# Print summary table
sum_table
## mean_wgt n
## 1 108.2 10
# Calculate the avg weight and minimum weight. Don't create a new object.
panda_data %>%
summarise(avg_wt = mean(weight_kg),
min_wt = min(weight_kg))
## avg_wt min_wt
## 1 108.2 89
Grouping data together for class summaries has been improved in the
latest version of dplyr by adding .by to the
summarise() function. As a result, it is no longer
necessary to use the function Group_by() in addition and/or
external to summarise(). The .by argument
allows you to group your columns of interest and generate a series of
summary statistics.
# Similar to above, calculate average and minimum weight, but summarize based on 'base' column. Include a count of each group.
panda_data %>%
summarise(avg_wt = mean(weight_kg),
min_wt = min(weight_kg),
n = n(),
.by = base)
## base avg_wt min_wt n
## 1 CD 112.0000 98 4
## 2 WL 105.6667 89 6
# Group summaries can also be calculated across muultiple groups.
# Here, we calculate the same as above, but based on base and sex
panda_data %>%
summarise(avg_wt = mean(weight_kg),
min_wt= min(weight_kg),
n = n(),
.by = c(base, sex))
## base sex avg_wt min_wt n
## 1 CD m 100.00 100 1
## 2 CD f 116.00 98 3
## 3 WL f 107.50 95 2
## 4 WL m 104.75 89 4
The last of the important tools we will cover in dplyr
is how to join tables together. This is common practice in ecological
statistics, with data in one table that are required to be appended to
another table for analyses. To join these tables, we need to link the
tables based on shared columns.
As an example, we will import a table
(panda_data_med.csv) located in our Data
folder that summarizes the vaccination history of each our pandas.
Similar to other functions, we have multiple options to join the tables
together. The most common join to use is a left_join(). In
this case, all the records in our first table (i.e., the left table) are
linked with those that match or are shared with the second data table
(i.e., the right table). The means that we will keep all the records in
the left table and include only those records that match in the right
table (they will be returned NA if no match exists).
Other join options, including inner_join(),
right_join, and full_join() also exist. Read
the help file to determine which join is most appropriate
to accomplish the activity of interest.
Note: Sometimes the column headings between tables
do not match, even if the data within the column does (ID
in table 1 != id in table 2). In these cases, you can
either rename() the column headings to make them match or
specify columns you want to join.
# Read in the vaccination table
panda_med <- read.csv(file="Data/panda_data_med.csv")
# Join all the rows in table 1 (panda_data) with table 2 (panda_med) to determine which of the pandas were vaccinated. Since ID exists in both tables, this is a straightforward join.
# Arrange the result by ID and year vaccinated
panda_join_ex1 <- panda_data %>%
left_join(panda_med, by = "ID") %>%
arrange(ID, year_vaccination)
# Print
head(panda_join_ex1)
## panda_name.x ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 mao_mao 4415522 4 120 f CD 70 30
## 4 lan_lan 4416073 7 95 f WL 80 NA
## 5 bei_bei 4416405 5 120 f WL 80 NA
## 6 bao_bao 4417779 5 110 m WL 75 60
## genetic_value3 year panda_name.y year_vaccination vaccine_type
## 1 60 2017 da_da 2018 canine distemper
## 2 90 2017 mao_mao 2017 canine distemper
## 3 90 2017 mao_mao 2018 rabies
## 4 80 2017 <NA> NA <NA>
## 5 65 2017 <NA> NA <NA>
## 6 76 2017 <NA> NA <NA>
# Example of how to join based on multiple fields and fields that don't exactly match
panda_join_ex2 <- panda_data %>%
left_join(panda_med,
by = c("ID" = "ID",
"year"= "year_vaccination"))
# Print
head(panda_join_ex2)
## panda_name.x ID age weight_kg sex base genetic_value1 genetic_value2
## 1 da_da 4415463 5 100 m CD 99 50
## 2 mao_mao 4415522 4 120 f CD 70 30
## 3 lan_lan 4416073 7 95 f WL 80 NA
## 4 bei_bei 4416405 5 120 f WL 80 NA
## 5 bao_bao 4417779 5 110 m WL 75 60
## 6 tian_tian 4424490 4 89 m WL 95 70
## genetic_value3 year panda_name.y vaccine_type
## 1 60 2017 <NA> <NA>
## 2 90 2017 mao_mao canine distemper
## 3 80 2017 <NA> <NA>
## 4 65 2017 <NA> <NA>
## 5 76 2017 <NA> <NA>
## 6 87 2017 <NA> <NA>
Now that you’ve been exposed to a few functions to better manage data
in R, we will reinforce these tools by asking you to apply them to
memory by practicing on a dataset of wildlife counts collected across
the Naboisho conservancy. Located in your Data folder and
named SampleData.csv, these data are derived from our
publication in the African Journal of Ecology that forms the foundation
for this project.
Upon import you will note that the dataset is comprised of 5 columns:
Date: The date the transect was conductedSpecies: Species nameGroupSize: Total number of animals in the counted
groupTransectID: Name of each transectYear: Year in which the data were collectedPlease create a script named Exercise1_DataMgmt.R.
Format the script like we have instructed to do so in this lecture with
your Name, Date, and Description.
Then, read in the data and create a fully commented code to answer the
following questions:
Please use a %>% for the followings questions (you don’t need to save the output):
Note: We didn’t go over every function to answer
these questions. You may need to troubleshoot and search the web to find
the appropriate function. Remember to use R-Cran when
searching to limit/fine tune your search. Using dplyr in
your search term can also help in narrowing down your search.
The data used for this exercise can be referenced as:
Crego RD, Wells HBM, Connette G, Stabach JA, Soit N, Thompson S. 2023. Monitoring spatiotemporal dynamics of large herbivores across an African rangeland using hierarchical multi-species distance sampling. African Journal of Ecology. doi:10.1111/aje.13219
# Read in the CSV
survey_sample <- read.csv(file = "Data/SampleData.csv")
# Look at the data
head(survey_sample)
## Date Species GroupSize TransectID Year
## 1 2017-11-02 Hartebeest 8 Balanite plains 2017
## 2 2018-01-11 Hartebeest 1 Balanite plains 2018
## 3 2018-01-11 Hartebeest 2 Balanite plains 2018
## 4 2018-01-11 Hartebeest 2 Balanite plains 2018
## 5 2018-02-01 Hartebeest 1 Balanite plains 2018
## 6 2018-08-02 Hartebeest 5 Balanite plains 2018
# Question 1: What are the dimensions of the dataset? How many rows and columns?
dim(survey_sample)
## [1] 3679 5
nrow(survey_sample)
## [1] 3679
ncol(survey_sample)
## [1] 5
# Question 2: What is the data structure of each variable?
str(survey_sample)
## 'data.frame': 3679 obs. of 5 variables:
## $ Date : chr "2017-11-02" "2018-01-11" "2018-01-11" "2018-01-11" ...
## $ Species : chr "Hartebeest" "Hartebeest" "Hartebeest" "Hartebeest" ...
## $ GroupSize : int 8 1 2 2 1 5 1 2 1 1 ...
## $ TransectID: chr "Balanite plains" "Balanite plains" "Balanite plains" "Balanite plains" ...
## $ Year : int 2017 2018 2018 2018 2018 2018 2018 2017 2017 2018 ...
# Question 3: How many species were counted?
n_distinct(survey_sample$Species)
## [1] 10
length(unique(survey_sample$Species))
## [1] 10
# Question 4: How many transects are there and what are the names of these transects? Same process, but a different column.
n_distinct(survey_sample$TransectID)
## [1] 8
unique(survey_sample$TransectID)
## [1] "Balanite plains" "Rekero bushlands" "Enoolera plains"
## [4] "Olare Sampu bushlands" "Sampu Enkare bushlands" "Ilkisieusieu bushlands"
## [7] "Rekero plains" "Payia plains"
# Question 5: How many unique survey days were completed? What is the range of dates?
n_distinct(survey_sample$Date)
## [1] 37
length(unique(survey_sample$Date))
## [1] 37
min(survey_sample$Date)
## [1] "2017-10-17"
max(survey_sample$Date)
## [1] "2018-12-06"
range(survey_sample$Date)
## [1] "2017-10-17" "2018-12-06"
# Question 6: What is the maximum group size of giraffe?
survey_sample %>%
filter(Species == "Giraffe") %>%
summarise(max_grp = max(GroupSize))
## max_grp
## 1 12
# Question 7: What is the mean group size of wildebeest per year? Include the number of observations in the calculation.
survey_sample %>%
filter(Species == "Wildebeest") %>%
summarise(mean_grp = mean(GroupSize),
n_obs = n(), # This counts the observations, it is not a sum of the observations.
.by=Year)
## Year mean_grp n_obs
## 1 2017 8.305419 203
## 2 2018 10.794664 862
# Question 8: What is the total count of impala observed in 2018?
survey_sample %>%
filter(Species == "Impala" & Year == "2018") %>%
summarise(sum_impala = sum(GroupSize))
## sum_impala
## 1 2182
# Question 9: What is the total number of groups of each species per year? Note, not the total animals counted.
survey_sample %>%
summarise(Total_Grps = n(),
.by=c(Species, Year))
## Species Year Total_Grps
## 1 Hartebeest 2017 3
## 2 Hartebeest 2018 21
## 3 Eland 2017 2
## 4 Eland 2018 32
## 5 Giraffe 2017 15
## 6 Giraffe 2018 62
## 7 Grant's gazelle 2017 56
## 8 Grant's gazelle 2018 119
## 9 Impala 2017 158
## 10 Impala 2018 299
## 11 Thomson's gazelle 2017 297
## 12 Thomson's gazelle 2018 736
## 13 Topi 2017 60
## 14 Topi 2018 171
## 15 Warthog 2017 39
## 16 Warthog 2018 167
## 17 Wildebeest 2017 203
## 18 Wildebeest 2018 862
## 19 Plains zebra 2017 64
## 20 Plains zebra 2018 313
# Question 10: What is the total number of groups and the mean group size of each species, per year, and per transect? Sort these by Species, TransectID, and Year
survey_sample %>%
summarise(Total_Grps = n(),
Mean_Grps = mean(GroupSize),
.by = c(Species, Year, TransectID)) %>%
arrange(Species, TransectID, Year)
## Species Year TransectID Total_Grps Mean_Grps
## 1 Eland 2018 Balanite plains 4 5.500000
## 2 Eland 2018 Enoolera plains 4 3.000000
## 3 Eland 2018 Ilkisieusieu bushlands 2 28.500000
## 4 Eland 2017 Olare Sampu bushlands 1 7.000000
## 5 Eland 2018 Olare Sampu bushlands 5 5.600000
## 6 Eland 2017 Rekero bushlands 1 1.000000
## 7 Eland 2018 Rekero bushlands 3 2.666667
## 8 Eland 2018 Rekero plains 6 19.666667
## 9 Eland 2018 Sampu Enkare bushlands 8 7.625000
## 10 Giraffe 2017 Balanite plains 2 1.500000
## 11 Giraffe 2018 Balanite plains 5 2.800000
## 12 Giraffe 2018 Enoolera plains 2 2.000000
## 13 Giraffe 2017 Ilkisieusieu bushlands 2 5.500000
## 14 Giraffe 2018 Ilkisieusieu bushlands 18 1.888889
## 15 Giraffe 2017 Olare Sampu bushlands 6 2.166667
## 16 Giraffe 2018 Olare Sampu bushlands 9 2.888889
## 17 Giraffe 2018 Payia plains 2 4.500000
## 18 Giraffe 2017 Rekero bushlands 3 2.000000
## 19 Giraffe 2018 Rekero bushlands 11 2.363636
## 20 Giraffe 2018 Rekero plains 3 5.666667
## 21 Giraffe 2017 Sampu Enkare bushlands 2 1.000000
## 22 Giraffe 2018 Sampu Enkare bushlands 12 3.000000
## 23 Grant's gazelle 2017 Balanite plains 15 4.133333
## 24 Grant's gazelle 2018 Balanite plains 26 3.923077
## 25 Grant's gazelle 2017 Enoolera plains 9 1.777778
## 26 Grant's gazelle 2018 Enoolera plains 16 2.937500
## 27 Grant's gazelle 2017 Ilkisieusieu bushlands 1 5.000000
## 28 Grant's gazelle 2018 Ilkisieusieu bushlands 9 3.000000
## 29 Grant's gazelle 2017 Olare Sampu bushlands 12 4.416667
## 30 Grant's gazelle 2018 Olare Sampu bushlands 10 3.500000
## 31 Grant's gazelle 2018 Payia plains 7 2.714286
## 32 Grant's gazelle 2017 Rekero bushlands 11 2.454545
## 33 Grant's gazelle 2018 Rekero bushlands 20 2.750000
## 34 Grant's gazelle 2017 Rekero plains 4 1.250000
## 35 Grant's gazelle 2018 Rekero plains 11 1.909091
## 36 Grant's gazelle 2017 Sampu Enkare bushlands 4 2.750000
## 37 Grant's gazelle 2018 Sampu Enkare bushlands 20 2.600000
## 38 Hartebeest 2017 Balanite plains 1 8.000000
## 39 Hartebeest 2018 Balanite plains 6 2.000000
## 40 Hartebeest 2018 Enoolera plains 1 1.000000
## 41 Hartebeest 2017 Rekero bushlands 2 1.500000
## 42 Hartebeest 2018 Rekero bushlands 14 3.214286
## 43 Impala 2017 Balanite plains 22 5.454545
## 44 Impala 2018 Balanite plains 29 4.689655
## 45 Impala 2017 Enoolera plains 9 2.111111
## 46 Impala 2018 Enoolera plains 11 5.090909
## 47 Impala 2017 Ilkisieusieu bushlands 11 17.545455
## 48 Impala 2018 Ilkisieusieu bushlands 40 7.625000
## 49 Impala 2017 Olare Sampu bushlands 30 3.166667
## 50 Impala 2018 Olare Sampu bushlands 76 7.263158
## 51 Impala 2017 Payia plains 1 100.000000
## 52 Impala 2018 Payia plains 7 22.285714
## 53 Impala 2017 Rekero bushlands 59 5.915254
## 54 Impala 2018 Rekero bushlands 67 8.716418
## 55 Impala 2017 Rekero plains 5 6.400000
## 56 Impala 2018 Rekero plains 10 3.800000
## 57 Impala 2017 Sampu Enkare bushlands 21 3.142857
## 58 Impala 2018 Sampu Enkare bushlands 59 6.016949
## 59 Plains zebra 2017 Balanite plains 19 2.894737
## 60 Plains zebra 2018 Balanite plains 27 21.407407
## 61 Plains zebra 2017 Enoolera plains 6 5.666667
## 62 Plains zebra 2018 Enoolera plains 22 4.500000
## 63 Plains zebra 2017 Ilkisieusieu bushlands 11 2.454545
## 64 Plains zebra 2018 Ilkisieusieu bushlands 46 9.500000
## 65 Plains zebra 2017 Olare Sampu bushlands 1 11.000000
## 66 Plains zebra 2018 Olare Sampu bushlands 42 5.380952
## 67 Plains zebra 2017 Payia plains 12 2.750000
## 68 Plains zebra 2018 Payia plains 60 5.016667
## 69 Plains zebra 2017 Rekero bushlands 5 2.000000
## 70 Plains zebra 2018 Rekero bushlands 58 5.568966
## 71 Plains zebra 2017 Rekero plains 5 3.200000
## 72 Plains zebra 2018 Rekero plains 12 8.916667
## 73 Plains zebra 2017 Sampu Enkare bushlands 5 4.800000
## 74 Plains zebra 2018 Sampu Enkare bushlands 46 5.152174
## 75 Thomson's gazelle 2017 Balanite plains 58 5.103448
## 76 Thomson's gazelle 2018 Balanite plains 108 6.796296
## 77 Thomson's gazelle 2017 Enoolera plains 46 7.369565
## 78 Thomson's gazelle 2018 Enoolera plains 111 7.540541
## 79 Thomson's gazelle 2017 Ilkisieusieu bushlands 12 4.250000
## 80 Thomson's gazelle 2018 Ilkisieusieu bushlands 52 6.538462
## 81 Thomson's gazelle 2017 Olare Sampu bushlands 13 7.615385
## 82 Thomson's gazelle 2018 Olare Sampu bushlands 44 4.386364
## 83 Thomson's gazelle 2017 Payia plains 69 4.855072
## 84 Thomson's gazelle 2018 Payia plains 169 6.923077
## 85 Thomson's gazelle 2017 Rekero bushlands 17 4.647059
## 86 Thomson's gazelle 2018 Rekero bushlands 30 4.933333
## 87 Thomson's gazelle 2017 Rekero plains 55 4.327273
## 88 Thomson's gazelle 2018 Rekero plains 106 7.669811
## 89 Thomson's gazelle 2017 Sampu Enkare bushlands 27 4.037037
## 90 Thomson's gazelle 2018 Sampu Enkare bushlands 116 6.362069
## 91 Topi 2017 Balanite plains 5 2.800000
## 92 Topi 2018 Balanite plains 33 4.606061
## 93 Topi 2017 Enoolera plains 11 2.363636
## 94 Topi 2018 Enoolera plains 26 2.230769
## 95 Topi 2017 Ilkisieusieu bushlands 1 4.000000
## 96 Topi 2018 Ilkisieusieu bushlands 7 3.285714
## 97 Topi 2017 Olare Sampu bushlands 3 1.333333
## 98 Topi 2018 Olare Sampu bushlands 17 2.882353
## 99 Topi 2018 Payia plains 1 1.000000
## 100 Topi 2017 Rekero bushlands 32 2.875000
## 101 Topi 2018 Rekero bushlands 46 1.891304
## 102 Topi 2017 Rekero plains 2 1.000000
## 103 Topi 2018 Rekero plains 8 2.750000
## 104 Topi 2017 Sampu Enkare bushlands 6 1.833333
## 105 Topi 2018 Sampu Enkare bushlands 33 1.848485
## 106 Warthog 2017 Balanite plains 10 1.900000
## 107 Warthog 2018 Balanite plains 31 2.419355
## 108 Warthog 2017 Enoolera plains 3 4.000000
## 109 Warthog 2018 Enoolera plains 16 2.500000
## 110 Warthog 2017 Ilkisieusieu bushlands 4 2.500000
## 111 Warthog 2018 Ilkisieusieu bushlands 13 2.307692
## 112 Warthog 2017 Olare Sampu bushlands 8 2.000000
## 113 Warthog 2018 Olare Sampu bushlands 44 2.704545
## 114 Warthog 2017 Payia plains 3 1.333333
## 115 Warthog 2018 Payia plains 8 2.500000
## 116 Warthog 2017 Rekero bushlands 3 1.000000
## 117 Warthog 2018 Rekero bushlands 11 1.909091
## 118 Warthog 2017 Rekero plains 1 1.000000
## 119 Warthog 2018 Rekero plains 1 1.000000
## 120 Warthog 2017 Sampu Enkare bushlands 7 2.000000
## 121 Warthog 2018 Sampu Enkare bushlands 43 3.395349
## 122 Wildebeest 2017 Balanite plains 9 17.000000
## 123 Wildebeest 2018 Balanite plains 54 8.851852
## 124 Wildebeest 2017 Enoolera plains 31 8.387097
## 125 Wildebeest 2018 Enoolera plains 113 9.761062
## 126 Wildebeest 2017 Ilkisieusieu bushlands 3 8.333333
## 127 Wildebeest 2018 Ilkisieusieu bushlands 42 14.690476
## 128 Wildebeest 2017 Olare Sampu bushlands 25 15.400000
## 129 Wildebeest 2018 Olare Sampu bushlands 121 11.619835
## 130 Wildebeest 2017 Payia plains 36 6.777778
## 131 Wildebeest 2018 Payia plains 241 8.780083
## 132 Wildebeest 2017 Rekero bushlands 28 5.821429
## 133 Wildebeest 2018 Rekero bushlands 73 28.452055
## 134 Wildebeest 2017 Rekero plains 18 4.055556
## 135 Wildebeest 2018 Rekero plains 126 6.007937
## 136 Wildebeest 2017 Sampu Enkare bushlands 53 7.226415
## 137 Wildebeest 2018 Sampu Enkare bushlands 92 8.163043